Affordable Housing Distribution vs Low-Income Population

2.1 Distribution Comparision

The relationship between affordable housing distribution and low-income population distribution is a critical starting point for this analysis. Affordable housing is intended to address the needs of low-income populations, so comparing these distributions helps to identify whether housing resources are effectively located to serve those who need them most. Any misalignment between the two could indicate gaps in accessibility or inequities in resource allocation.

Code
import pandas as pd
import geopandas as gpd
import hvplot.pandas
import holoviews as hv
from shapely.geometry import Point
hv.extension('bokeh')

# Load datasets
affordable_housing_data = pd.read_csv("./geographicdatascience_python_finalproject/database/Affordable_Housing_Production_by_Building_20241224.csv")
income_data = pd.read_csv("./geographicdatascience_python_finalproject/database/Community_Development_Block_Grant__CDBG__Eligibility_by_Census_Tract_-_CSV_20241221.csv")
geo_data = gpd.read_file("./geographicdatascience_python_finalproject/database/nyc_tracts.json")

# Convert latitude and longitude to geometry for GeoDataFrame
affordable_housing_data["geometry"] = affordable_housing_data.apply(
    lambda row: Point(row["Longitude"], row["Latitude"]), axis=1
)

# Convert affordable housing data to GeoDataFrame
affordable_housing_gdf = gpd.GeoDataFrame(
    affordable_housing_data,
    geometry="geometry",
    crs="EPSG:4326"
)

# Ensure CRS matches between GeoDataFrames
affordable_housing_gdf = affordable_housing_gdf.to_crs(geo_data.crs)

# Perform spatial join between affordable housing data and census tracts
affordable_housing_with_tracts = gpd.sjoin(affordable_housing_gdf, geo_data, how="left", predicate="intersects")

# Summarize affordable housing count by census tract
housing_summary_by_tract = (
    affordable_housing_with_tracts.groupby("BoroCT2020")["Building ID"]
    .count()
    .reset_index()
)
housing_summary_by_tract.rename(columns={"Building ID": "AffordableHousingCount"}, inplace=True)

# Merge affordable housing summary into geo_data
geo_data = geo_data.merge(housing_summary_by_tract, on="BoroCT2020", how="left")
geo_data["AffordableHousingCount"] = geo_data["AffordableHousingCount"].fillna(0)

# Match income data with geo_data using BoroCT fields
income_data["BoroCT"] = income_data["BoroCT"].astype(str).str.zfill(11)
geo_data["BoroCT2020"] = geo_data["BoroCT2020"].astype(str).str.zfill(11)

# Merge low-income population data into geo_data
geo_data = geo_data.merge(
    income_data[["BoroCT", "LowMod_Population"]], left_on="BoroCT2020", right_on="BoroCT", how="left"
)
geo_data["LowMod_Population"] = geo_data["LowMod_Population"].fillna(0)

# Map Affordable Housing Distribution
map_affordable_housing = geo_data.hvplot.polygons(
    "geometry",
    color="AffordableHousingCount",  # Map AffordableHousingCount to color
    cmap="Reds",
    line_color="white",
    hover_cols=["BoroCT2020", "AffordableHousingCount"],  # Display BoroCT2020 and AffordableHousingCount on hover
    title="Affordable Housing Distribution by Census Tract",
    aspect='equal',
    clim=(0, 50),  # Set color range between 0 and 50
    clipping_colors={'max': 'darkred'},  # Values above 50 will be dark red
    colorbar=True
)

# Map Low-Income Population Overlay
map_low_income = geo_data.hvplot.polygons(
    "geometry",
    color="LowMod_Population",  # Map LowMod_Population to color
    cmap="Greens",
    line_color="white",
    hover_cols=["BoroCT2020", "LowMod_Population"],  # Display BoroCT2020 and LowMod_Population on hover
    title="Low-Income Population by Census Tract",
    aspect='equal',
    colorbar=True  # Display color bar
)

# Combine Maps for Visualization
(map_affordable_housing + map_low_income).cols(1)

2.2 K-Means Cluster Analysis for Income and Affordable Housing

Firstly, we cleaned the data, excluded singular values to avoid disturbed results. Then, according to the elbow method, we chose k=3 as our optimal cluster number.

Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import hvplot.pandas
plt.style.use('default')

# Step 1: Prepare the Data
# Select features for clustering
clustering_data = geo_data[["AffordableHousingCount", "LowMod_Population"]].copy()

# Step 2: Identify and Remove Outliers
# Calculate IQR for both features
Q1 = clustering_data.quantile(0.25)
Q3 = clustering_data.quantile(0.75)
IQR = Q3 - Q1

# Define outlier thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
non_outliers = ~((clustering_data < lower_bound) | (clustering_data > upper_bound)).any(axis=1)

# Filter geo_data to only include non-outliers
geo_data_filtered = geo_data[non_outliers].copy()
clustering_data_filtered = clustering_data[non_outliers].copy()

# Step 3: Normalize the Filtered Data
scaler = StandardScaler()
clustering_data_scaled_filtered = scaler.fit_transform(clustering_data_filtered)

# Step 4: Determine Optimal Number of Clusters
inertia = []
for n_clusters in range(2, 10):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(clustering_data_scaled_filtered)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Curve
plt.figure(figsize=(8, 5))
plt.plot(range(2, 10), inertia, marker='o')
plt.title("Elbow Method for Optimal Clusters (Filtered Data)")
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.grid(True)
plt.show()

Code
# Step 5: Fit KMeans Model with Optimal Clusters
optimal_clusters = 3  # Adjust this value based on the Elbow Curve
kmeans_filtered = KMeans(n_clusters=optimal_clusters, random_state=42)
geo_data_filtered["Cluster"] = kmeans_filtered.fit_predict(clustering_data_scaled_filtered)
geo_data_filtered["Cluster"] = geo_data_filtered["Cluster"].astype(str)  # Convert to string for better visualization

# Step 6: Visualize Clustering Results
cluster_map_filtered = geo_data_filtered.hvplot.polygons(
    "geometry",
    color="Cluster",  # Use Cluster column for coloring
    cmap={  # Map each cluster to a specific color
        0: "#fffbdf",  
        1: "#ff6f64",  
        2: "#ffb164", 
    },
    line_color="white",
    hover_cols=["AffordableHousingCount", "LowMod_Population", "Cluster"],
    title="Clustering Analysis of Affordable Housing and Low-Income Population (Filtered)",
    aspect='equal',
    colorbar=False
)

# Display the filtered map
cluster_map_filtered

The Cluster results show a trend of dispersion and aggregation, indicating that affordable housing and low-income groups tend to gather in New York.

Code
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Step 1: Ensure the Cluster column is of integer type
geo_data_filtered["Cluster"] = geo_data_filtered["Cluster"].astype(int)

# Step 2: Define colors for clusters and create a mapping
cluster_colors_filtered = {
    0: "#fffbdf",
    1: "#ff6f64",
    2: "#ffb164"
}

# Step 3: Scaling setup (same as before)
scaler = StandardScaler()
scaler.fit(clustering_data_filtered)
centers_original = scaler.inverse_transform(kmeans_filtered.cluster_centers_)

# Step 4: Plot with explicit color mapping
plt.figure(figsize=(8, 5))
for cluster_id in range(optimal_clusters):
    cluster_points = geo_data_filtered[geo_data_filtered["Cluster"] == cluster_id]
    plt.scatter(
        cluster_points["AffordableHousingCount"],
        cluster_points["LowMod_Population"],
        label=f"Cluster {cluster_id}",
        color=cluster_colors_filtered[cluster_id],  # Use dictionary mapping
        alpha=0.8
    )


# Step 5: Plot cluster centers in the original feature space
plt.scatter(
    centers_original[:, 0],  # AffordableHousingCount
    centers_original[:, 1],  # LowMod_Population
    c="black",
    marker="X",
    s=250,  # Size of cluster center markers
    label="Cluster Centers"
)

# Step 6: Customize the plot
plt.title("KMeans Clustering (Original Feature Space - Filtered)", fontsize=18)
plt.xlabel("Affordable Housing Count", fontsize=14)
plt.ylabel("Low-Income Population", fontsize=14)
plt.grid(color='gray', linestyle='--', linewidth=0.5)

# Step 7: Adjust legend position
plt.legend(title="Clusters", fontsize=12, loc='upper right', bbox_to_anchor=(1.15, 1))

# Step 8: Display the plot
plt.tight_layout()
plt.show()

It can be seen that low-income groups and affordable housing are mainly divided into three categories, mainly based on the comprehensive affect of low income and the number of affordable housing. Next, we calculated the corresponding linear relationship between low income and affordable housing theselves. The following conclusions were obtained.

Code
# Use the filtered data for correlation analysis
correlation_filtered, p_value_filtered = pearsonr(
    geo_data_filtered["AffordableHousingCount"], geo_data_filtered["LowMod_Population"]
)
print(f"Filtered Correlation: {correlation_filtered}, P-value: {p_value_filtered}")

Results and Interpretation

  • Filtered Correlation: 0.30 (weak positive correlation)
  • P-value: 2.57e-43 (statistically significant)

Conclusion

Surprisingly, according to the data, there is a weak but statistically significant positive relationship between low-income population and affordable housing. This suggests some alignment between housing needs and supply, but the weak correlation indicates other factors might dilute the relationship, such as zoning laws, policy gaps, or spatial mismatches. Therefore, our next step is to explore more factors.